[SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate #28876

viirya · 2020-06-20T07:14:48Z

What changes were proposed in this pull request?

This patch applies NormalizeFloatingNumbers to distinct aggregate to fix a regression of distinct aggregate on NaNs.

Why are the changes needed?

We added NormalizeFloatingNumbers optimization rule in 3.0.0 to normalize special floating numbers (NaN and -0.0). But it is missing in distinct aggregate so causes a regression. We need to apply this rule on distinct aggregate to fix it.

Does this PR introduce any user-facing change?

Yes, fixing a regression of distinct aggregate on NaNs.

How was this patch tested?

Added unit test.

viirya · 2020-06-20T16:01:33Z

cc @cloud-fan @dongjoon-hyun

abellina · 2020-06-20T20:58:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala

-    val distinctAttributes = namedDistinctExpressions.map(_.toAttribute)
+    // Ideally this should be done in `NormalizeFloatingNumbers`, but we do it here because
+    // `groupingExpressions` is not extracted during logical phase.
+    val normalizednamedDistinctExpressions = namedDistinctExpressions.map { e =>


Suggested change

val normalizednamedDistinctExpressions = namedDistinctExpressions.map { e =>

val normalizedNamedDistinctExpressions = namedDistinctExpressions.map { e =>

I have a basic catalyst question and feel free to send me away. The question is what about being "named" is a requirement in this case. I bet it has to do with expression binding, but I am not entirely sure, and was wondering if you had that answer since you had to special case it here.

If you are questioning about why we need to have named expressions here, I think it is because we need these distinct expressions to be in the result expressions in Aggregate physical operator. These result expressions are for the output attributes.

Thanks for taking the time @viirya. I am not 100% sure when all the cases that need named expression, but that the physical node output expressions need to be named, makes sense to me. Seems like any downstream node that needs to refer to an output needs things like ExprId in order to distinguish fields.

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala

abellina · 2020-06-20T21:11:45Z

@viirya thanks for looking at this issue.

dongjoon-hyun · 2020-06-20T21:42:13Z

Thank you for pinging me, @viirya .

dongjoon-hyun · 2020-06-20T21:47:05Z

cc @gatorsmile too since this is a correctness issue at 3.0.0. I believe we need to include this in 3.0.1.
cc @HyukjinKwon since he is interested in 3.0.1 release.

viirya · 2020-06-20T22:42:24Z

@abellina @dongjoon-hyun Thanks for comment.

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

HyukjinKwon · 2020-06-21T01:40:15Z

Looks right to me

abellina

LGTM

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala

HyukjinKwon · 2020-06-22T07:06:57Z

retest this please

dongjoon-hyun

+1, LGTM. Thank you so much, @viirya .

SparkQA · 2020-06-22T11:56:01Z

Test build #124354 has finished for PR 28876 at commit bc159ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…nct aggregate ### What changes were proposed in this pull request? This patch applies `NormalizeFloatingNumbers` to distinct aggregate to fix a regression of distinct aggregate on NaNs. ### Why are the changes needed? We added `NormalizeFloatingNumbers` optimization rule in 3.0.0 to normalize special floating numbers (NaN and -0.0). But it is missing in distinct aggregate so causes a regression. We need to apply this rule on distinct aggregate to fix it. ### Does this PR introduce _any_ user-facing change? Yes, fixing a regression of distinct aggregate on NaNs. ### How was this patch tested? Added unit test. Closes #28876 from viirya/SPARK-32038. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 2e4557f) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2020-06-22T11:59:02Z

Thank you all. Merged to master/3.0.

maropu · 2020-06-23T07:38:33Z

late LGTM, thanks, @viirya

SPARK-32038 reports a regression in Apache Spark (3.0.0), in failing to normalize NaN/Zero float values, during DISTINCT aggregations. This causes a mismatch in results between Apache Spark 3.0.0 on CPU, and the Rapids Accelerator (which returns the right results). SPARK-32038 was fixed in apache/spark#28876. This commit introduces a conditional xfail test that passes on Apache Spark 3.0.1 and 3.1+ (which fixes SPARK-32038), but produces an expected failure on Spark 3.0.0.

NormalizeFloatingNumbers should also work on distinct aggregate.

d11001c

probot-autolabeler bot added the SQL label Jun 20, 2020

This comment has been minimized.

Sign in to view

abellina suggested changes Jun 20, 2020

View reviewed changes

abellina mentioned this pull request Jun 20, 2020

[BUG] count(distinct float_col) produces different results from CPU, for Float columns with NaNs NVIDIA/spark-rapids#194

Closed

Address comments.

dd66f7e

HyukjinKwon reviewed Jun 21, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 21, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala Outdated Show resolved Hide resolved

Remove unnecessary space.

6764a36

This comment has been minimized.

Sign in to view

abellina approved these changes Jun 21, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

maropu reviewed Jun 21, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala Outdated Show resolved Hide resolved

maropu approved these changes Jun 21, 2020

View reviewed changes

Rename variables in test.

0026e14

This comment has been minimized.

Sign in to view

dongjoon-hyun reviewed Jun 21, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jun 22, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala Outdated Show resolved Hide resolved

For comment.

e5211c3

HyukjinKwon reviewed Jun 22, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala Outdated Show resolved Hide resolved

Remove unused import.

bc159ab

cloud-fan approved these changes Jun 22, 2020

View reviewed changes

HyukjinKwon approved these changes Jun 22, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

dongjoon-hyun approved these changes Jun 22, 2020

View reviewed changes

dongjoon-hyun closed this in 2e4557f Jun 22, 2020

mythrocks mentioned this pull request Jun 23, 2020

Add conditional xfail test for DISTINCT aggregates with NaN NVIDIA/spark-rapids#261

Merged

viirya deleted the SPARK-32038 branch December 27, 2023 18:23

	val normalizednamedDistinctExpressions = namedDistinctExpressions.map { e =>
	val normalizedNamedDistinctExpressions = namedDistinctExpressions.map { e =>

[SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate #28876

[SPARK-32038][SQL] NormalizeFloatingNumbers should also work on distinct aggregate #28876

Uh oh!

Conversation

viirya commented Jun 20, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

This comment has been minimized.

viirya commented Jun 20, 2020

Uh oh!

abellina Jun 20, 2020

Choose a reason for hiding this comment

Uh oh!

abellina Jun 20, 2020

Choose a reason for hiding this comment

Uh oh!

viirya Jun 20, 2020

Choose a reason for hiding this comment

Uh oh!

abellina Jun 21, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abellina commented Jun 20, 2020

Uh oh!

dongjoon-hyun commented Jun 20, 2020

Uh oh!

dongjoon-hyun commented Jun 20, 2020

Uh oh!

viirya commented Jun 20, 2020

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jun 21, 2020

Uh oh!

This comment has been minimized.

abellina left a comment

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

HyukjinKwon commented Jun 22, 2020

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 22, 2020

Uh oh!

dongjoon-hyun commented Jun 22, 2020

Uh oh!

maropu commented Jun 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants